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ABSTRACT 

Motivation: The human microbiome plays an important role in 
human disease and health. Identification of factors that affect 
the microbiome composition can provide insights into disease 
mechanism as well as suggest ways to modulate the microbiome 
composition for therapeutical purposes. Distance-based statistical 
tests have been applied to test the association of microbiome 
composition with environmental or biological covariates. The 
unweighted and weighted UniFrac distances are the most widely 
used distance measures. However, these two measures assign too 
much weight either to rare lineages or to most abundant lineages, 
which can lead to loss of power when the important composition 
change occurs in moderately abundant lineages. 
Results: We develop generalized UniFrac distances that extend 
the weighted and unweighted UniFrac distances for detecting a 
much wider range of biologically relevant changes. We evaluate 
the use of generalized UniFrac distances in associating microbiome 
composition with environmental covariates using extensive Monte 
Carlo simulations. Our results show that tests using the unweighted 
and weighted UniFrac distances are less powerful in detecting 
abundance change in moderately abundant lineages. In contrast, 
the generalized UniFrac distance is most powerful in detecting such 
changes, yet it retains nearly all its power for detecting rare and 
highly abundant lineages. The generalized UniFrac distance also has 
an overall better power than the joint use of unweighted/weighted 
UniFrac distances. Application to two real microbiome datasets has 
demonstrated gains in power in testing the associations between 
human microbiome and diet intakes and habitual smoking. 
Availability: http://cran.r-project.org/web/packages/GUniFrac 
Contact: |hongzhe@upenn.edu 



microbiome composition can now be determined by direct DNA 
sequ encing without the need for laborious cultivation fPinsdale 



Supplementary information: [Supplementary data] are available at 
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1 INTRODUCTION 

Understanding the compositional differences of microbial 
communities (microbiomes) is essential in microbial ecology. 
With the development of next-generation sequencing technologies, 

To whom correspondence should be addressed. 



et a/.j2008l:lGill a/.L l2006l: iGrice a/.L l2009l: lOin et a/.Ll201C : 



Tringe et al , 20051 : IXurnbaugh et a/.l 20091 : Ivon Mering e^^l 



20071) . There has been a great interest in huma n microbiome studies 
in diff erent body sites, r anging from skin (iGrice et aA 2009h 
to gut (lArumugam et a/.L l201ll : iMuegge et a/.U201ll: Qin et a/.l 
l2010l : IWu et a/.Ll201lh and respiratory tract (ICharlson et a/.Ll2010^ . 
Important insights have been gained from analysis of large-scale 
human microbionie data , including the discovery of enterotypes 
(lArumugam et a/.L l201l[) and discove ry of the \mk between diet 
and these enterotypes (IWu a/ll201lh . Although the metagenomic 
shotgun approach is potentially more powerful and unbiased, 16S 
rRNA gene targeted sequencing is routinely performed to determine 
the taxonomic composition. The generated 16S rRNA sequence 
tags are usually clustered into operational taxonomic units (OTUs) 
with a specified amount of variability allowe d within each OTU 
(ICaporaso gfaZl l20ld : ISchloss g^ a/l l2009h . At 97% similarity, 
these OTUs represent 'species'. Downstream analysis is then 
performed on the OTU abundance data. 

Two central themes in human microbiome studies are to identify 
potential biological and environmental factors that are associated 
with microbiome composition, and to define the relationship 
between microbiome features and biological or clinical outcomes. 
The goal is to provide a better understanding of the factors that shape 
our microbiome and, potentially, contribute to the development of 
new therapeutic strategies t o modulate th e microbiome composition 
and affect human health ( ISpor et ali I2OIII : IVirgin and Toddl 
I2OIII) . Testing the association of microbiome composition with 
potential environmental factors using OTU abundances directly is 
difficult due to high dimensionality, non-normality and phylogenetic 
structure of the OTU data. Instead, distance-based non-parametric 
testing, in which a distance measure is defined-between any two 
micr obiome samples, is usually used to achieve this goal (^Char lson 



micr obiome samples, is usually used to achieve this goal (L-har lson 
et al^ 2010 :|Fukuyama et a/.Ll2012l: iKuczynski et a/.Ll2010al: Wu 



et a/., l2010Ll201ll) . The power of the distance-based test depends 
on a proper choice of the distance measure. Numerous distance 
measures have been proposed to compare microbial communities 
(iKuczynski et ali l2010bl : ISwensonl l201lh . Phylogenetic distance 
measures, which account for the phylogenetic relationship among 
the species, provide far more power because they exploit the 
degree of divergence between different sequences. Among these, 
the UniFrac distances are the most popular ones (Lozupone and 
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Knight, I2OO5I : iLozupone et all l2007h . There are two versions of 
UniFrac distances: an unweighted UniFrac distance that considers 
only species presence and absence information and counts the 
fraction of branch length unique to either community, and a weighted 
UniFrac distance that uses species abundance information and 
weights the branch length with abundance difference. Unweighted 
UniFrac distance is most efficient in detecting abundance change in 
rare lineages. When the abundance of a rare lineage falls below a 
certain threshold, the sequencing machine may not be able to pick it 
up and it will appear absent in the final dataset. On the other hand, 
weighted UniFrac distance is most sensitive to detect change in 
abundant lineages since it uses absolute abundance difference in its 
definition. However, unweighted/weighted UniFrac distances may 
not be very powerful in detecting change in moderately abundant 
lineages. Recently, a variance-adjusted weighted UniFrac distance 
(VAW-UniFrac), which moderates the branch proportion difference 
by its variance, was developed to account for the fact that weighted 
UniFrac distance does not consider the variation of the weights under 
random sampling dChang etadhonh . VAW-UniFrac was shown to 
increase the power over weighted UniFrac distance for detecting the 
difference between two microbial communities. 

In this article, we introduce generalized UniFrac distances that 
unify weighted UniFrac and unweighted UniFrac distances. The new 
generalized UniFrac distances cover a series of distances ranging 
from weighted to unweighted UniFrac by adjusting the weight on the 
branches. The generalized UniFrac distances are designed to provide 
a robust and powerful tool for detecting a wider range of biologically 
relevant changes in microbiome composition. We conduct extensive 
Monte Carlo simulation studies under various conditions to evaluate 
their power in detecting environme ntal influence on microbiome 
composition using PERMANOVA (iMcArdlel l200lh , a distance- 
based non-parametric test. Although each distance in the series 
can perform the best in certain scenarios, none has the optimal 
performance under all conditions considered. However, analyses 
based on the generalized UniFrac distances are shown to be more 
robust and has overall the best performances across a range of 
possible scenarios. We demonstrate the power gain of using this 
distance in detecting the microbiome differences by analysis of 
two real human gut microbiome datasets related to linking human 
gut microbiome composition to long-term diet (IWu et all I2OI1I) 
and testing upper respirato ry tract microbiome d ifference between 
smokers and non-smokers dCharlson et fl/.L[201Ql) . 

2 METHODS 

2.1 Generalized UniFrac distances between two 
microbial communities 

Consider two microbiome communities A and B and suppose that we have 
a rooted phylogenetic tree with n branches. Let bt be the length of the 
branch / and pf and pf are the taxa proportions descending from the 
branch i for community A and B, respectively. The unique fraction metric, 
or UniFrac, measures the phylogenetic distance between sets of taxa in a 
phylogenetic tree as the fraction of the branch length of the tree that leads to 
descendants from either one environment or the other, but not both. The 
origin al definition refers to unweighted UniFrac iLozupone and KnightL 
I2OO5I) . which is mathematically defined as 



bi\l(pf>0)-l(pf>0)\ 



where /(.) is the indicator function and only presence/absence of species 
of branch /, I(pf > 0) and I(pf > 0), are used in the definition. The distance 
definition completely ignores the taxa abu ndance information. In contrast, 
the (normahzed) weighted UniFrac distance iLozupone et fl/.ll2007l) weights 
the branch length with abundance difference and is defined as 

t^i\pt-pf\ 



i=\ 

Note that cannot be reduced to even if we convert abundance data 
into presence/absence data. Also note that uses the absolute proportion 

difference \pf —pf | in its formulation. The consequence of using the absolute 
difference is that the value of d^ is determined mainly by branches with large 
proportions and is less sensitive to the abundance changes on the branches 
with small proportions. To attenuate the weight on branches with large 
proportions, we may instead use the relative difference \pf — pf \ /(pf -{-pf) 
(e [0, 1]) in the formulation. We denote this distance measure as 



pf- 



pf+pf 



where YTi=\^i the denominator is the normalizing factor so that d^^^ e 
[0,1]. Now if we dichotomize the abundance data using the indication 
function /(.), d^^^ is reduced to d^ . So d^^^ can be seen as the 'weighted 
version' of d^ . Using the relative differences, we place equal emphasis on 
every branch and the distance is not dominated by the branches with large 
proportions, since the relative difference does not depend on the magnitude 
of pf,pf. However, the low- abundance branches may be more noisy and the 
relative difference may amplify such noises. To strike a balance between 
relative difference and absolute difference, we weight the branch length 
both by the relative difference and its importance indicated by the branch 
proportion. We propose the following generalized UniFrac distances 

pf-l 



J2bi(pf+pfr 



pf+pf 



Y^biipf+pfr 

where of e [0, 1] controls the contribution from high- abundance branches, and 
Y!i=\bi(pf+pfT is the normalizing factor so that J^"^e[0,l]. Branches 
with zero proportions for both communities will not be included in the 
calculation. As a changes from 0 to 1, more emphasis is placed on high- 
abundance branches. When a = 1, ^/^"^ is reduced to d^ . When a = 0, we get 
d^^^ defined above. 

Therefore, by varying a from 1 to 0 , we achieve a series of distances 
ranging from d^ to d^^\ Note that d^ is obtained by dichotomizing the 
abundance in d^'^\ but is different from d^'^\ We are particularly interested 
in d^^-^\ the distance in the middle of the distance series 

pf-pf 



ENpf+pf 

j(0.5)^i^l 



P? +Pf 



E^-v^H^ 

z=l 

We also compare d^ , d^^-^^ , J^^) and d^ with VAW-UniFrac distance d^^^ 
iChang a/ll201lh . which is defined as: 



E^. 



\pf-pf\ 

m(m — mi) 



m{m — Mi) 
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where m/ is the total number of individuals/reads from both communities on 
the ith branch and m is total number of individuals/reads. 

2.2 Statistical test based on UniFrac distances 

We study the power of generalized UniFrac distances using the distance- 
based non-parametric test for association of microbiome composition with 
environmental covariates. Suppose we have a set of m environmental 
covariates. We assume that we have collected microbiome data and the m- 
dimension al covariates dat a X on ^ samples. We apply the PERMANOVA 
procedure iMcArdlelEoQll) [Permutational Multivariate Analysis of Variance 
Usin g Distance Matrices, 'adonis' function from R package 'vegan' ( Oksanen 
et al, l201lb l. which partitions the distance matrix among sources of 
variation, fits hnear models to distance matrices and uses a permutation 
test with pseudo-F ratios to obtain the P- values. The pseudo-F statistic is 
defined as 

tr(HGH)/(m-l) 
tr[(I-H)G(I-H)]/(^-m)' 

where tr(.) is the trace function of a matrix, H = X(X^X)~^X^ is the hat 
(projection) matrix of the design matrix X, G is Gower's centered matrix and 
n and m is the number of samples and the number of predictors, respectively. 
Let dij be the generalized UniFrac distance between community / and j and 
denote A = (aij) = (—^dfj). The Gower's matrix is defined as 



where 1 is a vector of I's. 

Since d^ and d^ reflect the abundance change in either rare lineages 
or abundant lineages, combining d^ and d^ may potentially increase the 
overall power. Instead of applying Bonferroni correction to the P-values 
from separate PERMANOVA tests using d^ or d^ to control the family- 
wise Type I error rate, a more powerful approach is to take the maximum of 
pseudo-F statistics for d^ and d^ as a new test statistic. The significance of 
the pseudo-F statistics is assessed based on permutations. 

2.3 Simulation strategies 

We use two simulation strategies to evaluate the power of the generalized 
UniFrac distances under various conditions. T he first strateg y is a 
modification of the simulation method proposed bv lSchlossI J2008h . where 
we draw points (16S rRNA sequences) from a 2D circle with known densities 
(Fig. Q]V). This strategy facilitates simulations of different community 
characteristics such as species evenness and species richness. The Euclidean 
distance between points is analogous of the genetic distance between the 
sequences. The diameter of the circle represents the maximum genetic 
divergence between any pair of sequences within a sample. The area of 
the circle is proportional to the richness and the density distribution of the 
circle is proportional to the evenness. By varying the centroid positions (o) 
and their radius (r), it is possible to vary the fraction of shared membership 
and species richness within each sample (Fig. [T^ and D). By varying the 
point distribution on the circle (density proportional to r", where a controls 
the degree of evenness and a = 0.5 for uniformly distribution), it is possible 
to change the species evenness (Fig.^Z!). We also simulate scenarios where 
lineages of different abundance levels change by ak fold (Fig.[Tf-G). These 
are achieved by simulating the community with point mass concentrated at 
the circle center (r^ *^) and varying the point density in different regions of 
the 2D circle corresponding to abundant hneages (0—0. 2r from the center; 
Fig.[Tf ), moderately abundant lineages (0.4r— 0.8r from the center; Fig.[lf ) 
and rare lineages (0.8r— l.Or from the center; Fig. [TJj). We further bin 
the sampled points into small hexagons as 'OTU's before calc ulating the 
UniF rac distance ['hexbin' function from the R package 'hexbin' iCarr et alX 
l201ll) 1. The phylogenetic tree of these 'OTU's is built using NJ algorithm 
(Neighbor Joining, 'nj ' function in R) and rooted by midpoint rooting method. 
Generalized UniFrac distances are then calculated based on the NJ tree and 
'OTU' abundances. Each replication consists of drawing 400 points from 



each community, a bin size of 0.015 units to form 'OTUs' (~300 OTUs 
per sample), and the maximum distance between any two points is 0.3 units 
(r = 0.15), corresponding to typical phylum level divergence of 30% for 16S 
rRNA gene. These conditions allow us to simulate the sampling intensity 
and biodive rsity found with in a typical 16S rRNA gene targeted sequencing 
experiment dSchlosslEoOSl) . 

The second set of simulations utilize a real upper respiratory tract 
micr obiome dataset consisting of 60 samples and 856 OTUs from Charlson 
et al. <2010h (Fig. [Qj). A common way of modeling multivariate count 
data is to use the multinomial model. However, the multinomial model 
assumes fixed underlying proportions for each sample, which do not hold 
for real microbiome data due to high degree of heterogeneity among the 
samples. The real OTU count distribution jSupplementary Fig. SI A) exhibits 
more variance than expected from a multinomial model (Supplementary 
Fig. SIB). To reahstically simulate the data, it is important to model extra- 
variation or overdispersion of the OTU counts. This can be achi eved by 
using the Dirichlet-multinomial (DM) model iMosimanni Il962h . which 
assumes the underlying proportions of the multinomial model come from 
a Dirichlet distribution. The density function of a DM random variable N is 
given as 

k 



P{N = n) = 



n\j=\r=\ 



WW{nj{\-6) + {r-m 



W{l-6 + {r-m 



where n = ^jnj is total count, k is the OTU number and proportion mean 
71 = (tii , 712 , • • • , 71/:) and dispersion 6 are parameters. When ^ = 0, it is reduced 
to multinomial model. We estimate the DM parameters 71, 6 using maximum 
likelihood method ('dirmult' function from R package 'dirmult'). We then 
generate OTU counts using the DM model with the estimated parameters 
and 1000 counts per sample. [Supplementary Figure SlCj shows an OTU 
heatmap generated by the DM model, in which the overdispersion is 
similar to that of the real data. To study the power of UniFrac variants 
for identifying potential environmental factors, we let the abundance of a 
certain OTU cluster change in response to environment. We use the UPGMA 
tree of the OTUs based on the OT U distance matrix c alculated under the 
K80 nucleotide substitution model jFelsensteini |2004 , QIIME (FastTree 
algorithm (?)) and partition the 856 OTUs into 20 clusters using Partitioning 
Around Medoids (PAM) ('pam' function from R package 'cluster') based 
on patristic distances (the length of the shortest path linking two OTUs 
on the tree). These OTU clusters are highlighted in different colors in 
Figure [TJj. 

We call the first strategy 2D circle-based simulation and the second tree- 
based simulation. For power calculation, we use 2000 replications. 

3 RESULTS 

3.1 Comparison of the power of different UniFrac 
variants using 2D circle-based simulations 

We use PERMANOVA to test for environmental effect and 
compare the power of ,d^^-^\d^^\ d^ and d™^ . Specifically, 
we simulate two environmental conditions (e.g. smoking versus 
non-smoking) under which we draw 10 samples each. We then vary 
the degree of community difference under these two conditions and 
produce the power curve over a grid of 10 for each UniFrac distance. 
We investigate six scenarios, where the environmental factor affects 
the community membership, species evenness, species richness, 
most abundant lineages moderately abundant lineages and rare 
lineages (Fig. [T^-G). For each scenario, we vary one community 
characteristic ( [Supplementary Table Sl\ . Suppose xi and X2 are the 
mean values of the community characteristic for Conditions 1 and 
2. We simulate 10 communities for each condition with community 
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Fig. 1. Two simulation strategies to evaluate the generalized UniFrac distances. (A-G), 2D circle-based simulation of microbial communities with different 
characteristics. (A) The microbial community is represented by a 2D circle. Points are drawn from the circle to simulate the 16S-based sampling process. These 
points are further binned into small hexagons as OTUs. UPGMA or NJ method is used to build the OTU phylogenetic tree. Six scenarios are investigated, 
where the difference occurs in: community membership (B), evenness (C), richness (D), most abundant lineages (E), moderately abundant lineages (F) and 
rare lineages (G). The affected lineages are indicated by a red circle or ring. H, tree-based simulation of microbial communities based on the phylogenetic tree 
and DM model. A real OTU phylogenetic tree from a throat microbial community dataset is used. These OTUs are roughly divided into 20 clusters (lineages) 
by performing PAM method using the OTU patristic distance matrix. Each cluster is subjected to abundance change in response to the environment. Counts 
are generated from a DM model. 



characteristic value Xfj ^ Uniform(xy —s,Xj-\-s) for /= 1 ... 10 and 
7 = 1,2, where s controls the variation within each condition and 
takes different values for different scenarios (see Supplementary 
Table SI). Each community is sampled once. Initially, we let 
xi =X2 (no difference) and then increase the difference between xi 
and X2 to simulate stronger environmental effect. PERMANOVA 
is then performed on the distance matrices and the power curve is 
created over a grid of 10 using Type I error a = 0.05. Figure|2]shows 
the power curves for different UniFrac distances under the six 
scenarios considered. When the environmental factor has no effect 
(xi =X2), PERMANOVA controls the Type I error at the nominal 
level of 0.05 for all five UniFrac distances. As the environmental 
effect becomes stronger, all the distances have better power. When 
the environmental factor affects the community membership or 
richness (Panels 1 and 3), all the distances give a similar power and 
their power curves are nearly identical. For the evenness change 
scenario (Panel 2), the power of and d^^'^^ is very close and 
is more powerful than and d^ . d^ is the most powerful for 



detecting change in most abundant lineages (Panel 4) but is much 
less powerful for change in rare lineages (Panel 6). d^ shows an 
opposite trend: it is the most powerful for detecting change in rare 
lineages (Panel 6) but has almost no power for change in most 
abundant lineages (Panel 4). In contrast, d^^-^^ is the most powerful 
for detecting change in moderately abundant lineages (Panel 5). 
They are also the most robust among the distances investigated: its 
power is close to the best UniFrac distance under all scenarios. The 
performance of lies between 

^(0.5) 

and d^ and is also very 
robust. Finally, under the 2D circle simulations, the performance of 
^VAW |g almost identical to that of d^^-^\ 

In the above simulations, we use a bin size of 0.015 to form 
'OTU's ('^300 OTUs per sample). To study the effect of bin size, 
we compare the power curves of the generalized UniFrac distances 
using a smaller bin size of 0.01 (~700 OTUs per sample) or a 
larger bin size of 0.03 ('^SO OTUs per sample). The bin size does 
not change the general conclusion ( [Supplementary Fig. S2^ . To 
study the effect of tree construction methods, we also construct the 
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^ ^ ^(0.5) -v ^(0) ^ + d^^^ 
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2^ 4 6 8 10^ 2 4 6 8 10^ 
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Fig. 2. Power comparison of different UniFrac variants for detecting 
environmental effect using 2D circle-based simulation. PERMANOVA is 
used for testing hypotheses. The specific community difference caused by 
different environmental conditions is indicated in the panel title. The power 
curves are created by varying the degree of environmental effect. The initial 
point of the power curve is the power when there is no environmental effect. 

phylogenetic tree using UPGMA. The general conclusions still hold 
( [Supplementary Fig. S3\ . 



for each condition with their OTU counts generated by the DM 
model with the corresponding proportion vector and the common 
dispersion parameter. As expected, the five UniFrac distances differ 
in their power for detecting environmental effect for the 20 OTU 
clusters tested. Except for , all the UniFrac distances have their 
best-performance scenarios, , d^^-^\ d^ and d^"^^ achieve the 
highest power in seven, six, three and one cases, respectively. For 
the remainins three cases, d and are equally the most 

powerful ( [Supplementary Fig. S4^ . The results are consistent with 
the 2D circle-based simulation: d^ is most powerful in detecting 
the environmental effect on most abundant lineages, d^^'^^ is most 
powerful for moderately abundant lineages and d^ is most powerful 
for rare lineages. In contrast, performance of the test with d^^^ and 
^VAW generally between d^ and d^^-^\ The power of d^ and 
d^ has a reciprocal relationship and neither of them is as robust 
as 

^(0.5) 

. Figure |3}\ shows the power curves of four representative 
cases. As the proportion of the affected cluster decreases from 19.7% 
to 0.9%, d^ becomes less powerful and the power of d^ has the 
opposite trend. 

In the simulations presented above, the power is calculated 
assuming the affected cluster is known. Since the affected cluster can 
be abundant or rare, we randomly choose an affected OTU cluster 
in each replication and calculate the power over 2000 replications. 
We also report the power for the test combining d^ and d^ by 
taking the maximum of their pseudo-F statistics. We denote this 
method as d^"^. Figure[3^ (left plot) shows that d^ has the lowest 
overall power and d^^-^^ and d^"^ have the best power, indicating 
combining d^ and d^ can lead to power gain. In contrast, the power 
is in between. As the environmental effect 
becomes stronger, becomes as powerful as d^^-^^ and d^^^ . 
Finally, we assume that the environmental factor affects a random 
set of 40 OTUs instead of a random OTU cluster. At this extreme 
where the phylogenetic relationship is no longer important, d^^-^^ 
has even higher power than the other distances, followed by 
jMAZ^ J^, d^ and J^^^ (see Fig.EB, right plot). Overall, d^^-^^ 
has a better power than other UniFrac distances including the one 
that combines d^ and d^ . 



3.2 Comparison of the power of different UniFrac 
variants using tree-based simulations 

We also compare the power of different UniFrac distances for 
detecting environmental effect using tree-based simulations that 
mimic the throat microbiome data (see Section 13.41 for details). 
A recently pro p osed variance adjusted UniFrac distance {d^^^) 
( IChang et ali l201ll) is also compared in this setting, d^^^ 
was developed to moderate the branch proportion difference by 
its variance and was shown to increase the power of detecting 
the difference between two microbial communities. We use the 
phylogenetic tree of the 856 OTUs from the throat microbiome 
dataset and divide them into 20 clusters (Fig.[Tfl). The mean OTU 
proportions and the dispersion parameter are estimated from the 
real data by fitting a DM model. We assume that the environmental 
factor causes an increase of the abundance of a particular OTU 
cluster. Specifically, suppose that the proportion of the /th OTU 
cluster under Condition 1 is For Condition 2, the proportion of 
/th OTU cluster is increased by k fold where k varies from 1 (no 
difference) to \l ^/pi (strong effect) on a grid of 10. The proportion 
vector is re-normalized to sum to 1. Next, 10 samples are simulated 



3.3 Results from analysis of a dataset linking long-term 
diet to gut microbiome composition 

Diet strongly affects the human health, p artly by modulating gut 
microbiome composition. I Wu et al\ ( 120111) studied the habitual diet 
effect on the human gut microbiome, where the diet information 
was converted into a vector of micro-nutrient intakes. A total of 
98 healthy volunteers were enrolled in this cross-sectional study. 
Habitual long-term diet information was collected using food 
frequency questionnaire (FFQ). The questionnaires were converted 
to intake amounts of 214 micro-nutrients. Nutrient intake was 
further normalized using the residual method to standardize for 
caloric intake. Stool samples were collected and DNA samples were 
analyzed by 454/Roche pyrosequencing of 16S rRNAgene segments 
of th e VI -¥2 region. The pvroseauences were denoised (^ Ouince 
et g/. J2OO9I) prior to taxonomic assignment yielding an average of 
9265±3864(SD) reads per sample. T he denoised sequences were 
then analyzed by the QIIME pipeline (ICaporaso et all l2010h with 
the default parameter settings in the QIIME pipeline. The OTU 
table contains 3068 OTUs after discarding the singleton OTUs. One 
objective of the study is to identify nutrients that have a significant 
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Fig. 3. Power comparison of different UniFrac variants for detecting 
environmental effect using tree-based simulation. PERMANOVA is used 
for testing hypotheses. The power curves are created by varying the degree 
of environmental effect. (A) The environmental factor affects a particular 
lineage (OTU cluster). Four example lineages of different abundance levels 
that are affected by environment are given. The lineage abundance is given in 
parentheses in the panel title. (B) The environmental factor affects a random 
lineage (left panel) or a random subset of 40 OTUs (right panel). The initial 
point of the power curve is the power when there is no environmental effect. 

impact on the gut microbiome composition. We use PERMANOVA 
to test for association of microbiome composition with nutrient 
intake based on different UniFrac distance matrices. We compare 
j(0-5) with J^, , their combination d^^^ and J^^^. We plot 
the number of selected nutrients against different P-value cutoffs to 
create a ROC-like curve (Fig.|4]). For P < 0.01, all distances except 
the identify the same number of nutrients. For P > 0.03, the curve 
for d^^-^^ is above all the other four curves. Wilcoxon signed-rank 
tests show that d^^-^^ results in smaller P-values than other distances 
(P < 0.05), indicating that d^^-^^ is most powerful in selecting the 
relevant microbiome- associated nutrients. Using d^ or d^ alone 
could miss important associations when the nominal P-value of 
0.03-0.05 is used. Although using a relatively large nominal P- 
value can certainly lead to inclusion of possible false-positive 
nutrient-microbiome associations, there are situations when one 
might want to include all possible associations for further validation 
or replications, d^"^^ performs the second best. Interestingly, d^^^ , 
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Fig. 4. Comparison of different UniFrac variants for detecting nutrient 
effects on gut microbiome composition. PERMANOVA is used for testing 
hypotheses. 214 nutrients are included in the testing. The curves are generated 
by varying the P-value cutoffs. 

the joint use of d^ and d^ , does not increase the power over J^, 
indicating most associations can be recovered by d^ alone. 

3.4 Results from analysis of a throat microbiome 
dataset of smokers and non-smokers 

Cigarette smokers have an increased risk of multiple diseases, 
including upper respiratory tract infections. Previous studies had 
linked smoking to specific respiratory tract bacteria, but the 
consequences of smoking for global airway microbial community 
composition had not been fully clarified. Charlson et al (2010), 
investigated the smoking effect on the oropharyngeal and 
nasopharyngeal bacterial communities using 454 pyrosequencing 
of 16S sequence tags. Specifically, a total of 291 swab samples 
from the right and left nasopharynx and oropharynx of 29 smoking 
and 33 non- smoking healthy asymptomatic adults were collected. 
The variable region 1-2 (V1-V2) of the bacterial 16S rRNA gene 
was PCR-amplified using individually barcoded primer and subject 
to multiplexed pryo sequencing. The pyrosequences were denoised 
(lOuince et al prior to taxonomic assignment and yielded an 

average of 1335±603(SD) reads per airway sample. The denoised 
seque nces were then analvzed using the OIIME pipeline ( Caporaso 
et al. |201Q|) with default parameter setting. We use the left 
oropharyngeal samples in this study. After removing two samples 
with read number < 500 and discarding singleton OTUs, i.e OTU 
with only one read, we finally have an OTU table of 60 samples (28 
smokers versus 32 non-smokers) and 856 OTUs. 

We test the smoking effect on the throat microbial community 
composition by applying PERMANOVA (10000 permutations). All 
the five UniFrac distances achieve statistical significance dX a = 
0.05 level, indicating smoking alters the community composition. 
However, test using d^^-^^ produces the smallest P-value of 0.006, 
followed by 0.008 from d^^\ The P-values based on J^, d^ and 
^VAW 0.012, 0.019 and 0.043, respectively. We also perform 
a principle coordinate analysis using the distance matrices and 
plot the samples on the first two principle coordinates (Fig. [5]). 
The distance 

^(0.5) 

separates the samples better than the other 
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Fig. 5. Comparison of different UniFrac variants for clustering samples from 
smokers and non-smokers. Principle coordinate analysis is performed on 
the distance matrices of , d^^-^\ and d^^^ . The samples are plotted 
on the first two principle coordinates. The PERMANOVA P- values are 
also indicated in this figure. The ellipse center indicates groups means, its 
main axis corresponds to the first two principle components from principle 
component analysis and the height and width are variances on that direction. 

three distance measures. This indicates that smoking might affect 
not only the predominant lineages but also these less abundant 
lineages in the throat microbial community. We then perform 
Wilcoxon rank- sum or Fisher's exact test to select the differential 
OTUs. At a = 0.05 level, we identify 32 OTUs. These OTUs 
belong to genera Prevotella (8), Lachnospiraceae (5), Veillonella 
(3), Streptococcus (2), Fusobacterium (2), Treponema (2), Neisseria 
(1), Haemophilus (1), Megasphaera (1), Dialister (1), Moryella (1), 
Erysipelotrichaceae (1) and four genera from Actinobacteria. Most 
of the selected OTUs are moderately abundant or rare, so we expect 
to have better power. 



4 DISCUSSION 

Microbiome data are multivariate count data in their original 
form and are statistically challenging to analyze due to their 
high dimensionality, phylogenetic constraints among species/OTUs, 
overdispersion and excessive zeros. To circumvent the difficulty, 
the data are often summarized in the form of distance matrix. 
Testing association of microbiome composition with environmental 
covariates is performed using the distance matrix. We have 
demonstrated in simulations that the weighted and unweighted 
UniFrac impose large weight either to abundant lineages or to 
rare lineages; they can be underpowered in detecting change 
in moderately abundant lineages. Since microbiome composition 
change could occur in any lineages, our generalized UniFrac 
distances, which unify the weighted and unweighted UniFrac in 
a common framework, enable us to detect a much wider range 
of biologically relevant changes. Our simulation studies have 
clearly demonstrated that the generalized UniFrac distance 

^(0.5) 

is more robust than or , and its performances are in general 
comparable to the best UniFrac distances among the scenarios we 
considered. In addition, the generalized UniFrac distances are very 
robust to tree constructing methods. We suggest the use 



testing association of microbiome composition with environmental 
covariates to avoid missing important findings. 

Both weighte d and unweighted UniF' rac distances are sensitive to 
sampling depth (iLozupone et a/.lEoiOl) . Inflated distances at a low 
sampling depth are caused by sampling variation especially for the 
rare lineages. The generalized UniFrac distances are also sensitive to 
sampling depth ( [Supplementary Fig. S5\ . However, as the sampling 
depth increases, the distance stabilizes. For the gut microbiome 
dataset, we found a sequencing depth of '^lOOO reads is sufficient 
to stabilize the generalized UniFrac distances. To overcome the 
potential adverse effects of uneven sampling, rarefaction is usually 
used to subsample the samples to the same depth. However, when 
the sampling depth varies greatly across the samples, rarefaction 
throws away a significant portion of the 16S reads and increases 
the sampling variation. We found that rarefaction is not necessary, 
at least, in the context of testing the association of the microbiome 
composition with covariates ( [Supplementary Fig. S6| l. 

The power of UniFrac variants can also be compared in the 
context of testing whether two microbial communiti es differ 
significantly as in (IChang et ali l201ll : ISchlossL l2008h . Instead 
of comparing power for detecting the difference between two 
communities, we focus our evaluations on the performance 
of UniFrac distances for associating microbiome composition 
to environmental covariates by collecting multiple independent 
samples. The rational is that as the sequence depth increases, two 
sample comparison will have increased power to detect differences 
due to sources that we are not interested in (random noises), such 
as the individual-to-individual variability, day-to-day variability, 
sampling location variability or even technical variability (e.g. 
sample preparation). Multiple samples from a population coupled 
with multivariate statistical methods such as the distance-based 
PERMANOVA provide powerful design and analysis methods 
to ov ercome these potential random noises dLozupone et ail 
|201Q|) . As more and more large-scale microbiome datasets are 
being collected, we expect that our generalized UniFrac distances 
can help to identify important covariates that are associated 
with the microbiomes that could be missed using the commonly 
used UniFrac distances. In addition to identifying environmental 
covariates that may be determinants of microbiome composition, 
our approach would be equally suited to identifying microbiome 
features associated with biological or clinical outcomes, which is 
needed to begin to understand the impact of the microbiome on 
health. 
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